4. Interpreting Linear Models

  • model tuning and stability

  • interpretability

  • application to case study

Learning outcomes

  1. Describe the theoretical foundation of intrinsically interpretable models like sparse regression, gaussian processes, and classification and regression trees, and apply them to realistic case studies with appropriate validation checks.
  2. Compare the competing definitions of interpretable machine learning, the motivations behind them, and metrics that can be used to quantify whether they have been met.

Tuning

Coefficient Paths

  • \(\lambda\) is a hyperparameter that controls model complexity.
  • As \(\lambda \uparrow \infty\), all coefficients shrink toward zero. As \(\lambda \downarrow 0\), we return to linear regression.
  • We can study the order that variables enter the model as we gradually decrease \(\lambda\). We expect that the most important features enter first.

Coefficient Paths

  • \(\lambda\) is a hyperparameter that controls model complexity.

  • As \(\lambda \uparrow \infty\), all coefficients shrink toward zero. As \(\lambda \downarrow 0\), we return to linear regression.

  • We can study the order that variables enter the model as we gradually decrease \(\lambda\). We expect that the most important features enter first.

Coefficient Paths

Q: Can you tell if these coefficient paths \(\hat{\beta}_{j}\left(\lambda\right)\) come from ridge vs. lasso regression?

Choosing \(\lambda\)

  • What is an appropriate “budget” for model complexity?

  • Decreasing \(\lambda\) allows many noise features into the model. Increasing \(\lambda\) misses true signals.

  • We need to tune \(\lambda\) to balance these competing issues and achieve good performance on new samples.

Cross-validation

Split the data into \(K\) folds (e.g., \(K = 5\)). Fit the model on \(K−1\) folds and tested on the remaining fold. This mimics the setting of gathering new data and testing the model on that data.

Evaluation criteria

We can compute performance on the holdout folds across \(\lambda\) complexity parameters. For each \(k = 1, \dots, K\), we compute

\[\begin{align*} CV_{k}\left(\lambda\right) = \frac{1}{\left|I_{k}\right|} \sum_{i \in I_{k}} \left(y_{i} - \hat{y}_{i}^{-k}\left(\lambda\right)\right)^2 \end{align*}\] where \[\begin{align*} \hat{y}_{i}^{-k}\left(\lambda\right) := x_{i}^\top \hat{\beta}^{-k}\left(\lambda\right) \end{align*}\] and \(\hat{\beta}^{-k}\left(\lambda\right)\) solves the lasso optimization at hyperparameter \(\lambda\) using data from all folds except \(I_{k}\).

Evaluation criteria

Evaluation criteria

The initial decrease is the phase where removing noise features improves performance. The later increase happens when we start ignoring real signal features.

Hyperparameter Selection

  • \(\lambda_{\text{min}}\): Minimizes the cross-validation error.

  • \(\lambda_{\text{1se}}\): The simplest model, i.e., largest \(\lambda\), whose error is within one standard error of the minimum. This is a sparser model with comparable performance to the best.

Stability

Bootstrap

  1. Any statistical analysis depends on the particular data observed. Different data would yield different results.

  2. The bootstrap estimates this variability without collecting new data.

  3. In regularized regression, we quantify uncertainty in both predictions \(\hat{y}_i\) and coefficients \(\beta_k\).

Bootstrap Mechanism

Simulate new datasets by resampling the original data with replacement. For each bootstrap iteration \(b = 1, \dots, B\):

  1. Draw \(n\) samples with replacement from \(\{(y_i, x_i)\}_{i=1}^n\) to form

\[\begin{align*} \mathbf{y}^{b} = \begin{pmatrix}y_{1}^{b} \\ \vdots \\ y_{n}^{b}\end{pmatrix}, \quad \mathbf{X}^{b} = \begin{pmatrix}x_{1}^{b} \\ \vdots \\ x_{n}^{b}\end{pmatrix} \end{align*}\]

  1. Refit the model on \((\mathbf{y}^{b}, \mathbf{X}^{b})\) to obtain \(\hat{\beta}^{b}\) and \(\hat{y}_i^{b}\).

Bootstrap Analysis

  • Distribution of \(\{\hat{\beta}_{j}^{b}\}_{b=1}^B\) reveals stability of coefficient \(j\)
  • Distribution of \(\{\hat{y}_{i}^{b}\}_{b=1}^B\) reveals stability of prediction for observation \(i\)

Prediction Stability Plots

Prediction Stability Plots

  • Centered on diagonal: Lower bias
  • Shorter segments: Lower variance

Case Study

Case study formulation

  • Response (\(y\)): Cell viability after treatment with the drug Ibrutinib.

  • Predictors (\(X\)): High-dimensional molecular profiles \(N = 121, J = 9553\).

  • Goal: Identify a sparse set of features that distinguish drug sensitive vs. resistant cells.

Review Code Example

Model fitting

We use the glmnet package in R:

# Fit the lasso path and perform 10-fold CV
cv_fit <- cv.glmnet(X, y, alpha = 1, nfolds = 10, standardize = TRUE)

This solves the optimization problem for a sequence of \(\lambda\) values and estimates cross-validation error across \(K = 10\) folds.

  • alpha = 1 is lasso
  • alpha = 0 is ridge

Discussion

Respond to [Evaluation Critique] in the exercise sheet.

Case Study

Data preparation

  • It’s important to transform \(x_{ij} \to \frac{x_{ij} - \bar{x}_j}{\hat{\sigma}_j}\).

  • The penalty \(\lambda \sum |\beta_j|\) treats all \(\beta_j\) equally. If \(x_j\) has a large scale, its \(\beta_j\) will be small, making it “cheaper” for the penalty to keep it in the model.

Coefficient Paths

  • We can visualize how the lasso coefficients \(\hat{\beta}_j\) evolve as \(\lambda\) decreases.

  • Each line represents a gene (RNA or methylation feature). The order in which they “emerge” from zero indicates their relative importance in predicting drug response.

In-Class

  1. Show how to render quarto files.

  2. Check that coefficients stay at exactly 0 for many \(\lambda\).

  3. Compare results from lasso vs. ridge regression.

Instability

  • When predictors are highly correlated, the \(\hat{\beta}\) estimated by lasso becomes unstable.

  • Small changes in the training data can cause the lasso to switch between two highly correlated genes, leading to conflicting interpretations of the same underlying signal.

Simulation setup

We simulated data where \(\text{Cor}\left(x_{2}, x_{3}\right) > 0.95\). The true coefficients are

  • \(\beta_{1} = \beta_{2} = 2\),
  • \(\beta_{5} = 1.5\),
  • all other \(\beta_{j} = 0\).

Code: https://go.wisc.edu/m11v16

Simulation results

These are the feature selection frequencies across 1000 bootstrap iterations. Genes 1 and 2 compete for selection.

Takeaway

  • Selection \(\neq\) Importance

    • Just because a gene was not selected by the lasso does not mean it is not biologically irrelevant – it could be correlated with a gene that’s part of the true biological mechanism.
  • More generally, we should be careful applying sparse models to correlated data. There may be many plausible, equally predictive explanations based on different subsets of features.

Interpretability vs. accuracy trade-offs

  • Often, people assume that to get higher accuracy, we need “black-box” models (like deep learning or Random Forests) and sacrifice interpretability.

  • But in many scientific problems (like today’s case study, see also (Benchmarks - Open Problems in Single-Cell Analysis — Openproblems.bio”)), sparse linear models are surprisingly competitive, while remaining more interpretable to more audiences. It depends on the true relationship between predictors and response in the data.

Exercise

Respond to [Code Analysis] in the exercise sheet.

Benchmarks - Open Problems in Single-Cell Analysis — Openproblems.bio.” https://openproblems.bio/benchmarks.